1 Introduction

Phylogentic trees. Everyone loves them. Particularly in the context of an outbreak they are great to look at the genomic relatedness (or not) between isolates collected from both clinical cases and environmental swabs.

Generally trees have been put together by bioinformaticians and scientists working in whole genome sequencing (WGS) laboratories. However, in their raw form they can be challenging to correctly interpret and infer relevant information. WGS alone is not a silver bullet regarding outbreak investigations. WGS data, and by extension phylogentic visulisation of this data, depends on, and is greatly enhanced by, contextual epidemiological information.

As such, there is a greater need for epidemiologists to be able to work with tree data to effectively communicate to stakeholders. Whether this be the progress of an outbreak, identifying suspected causative agents based on genomic relatedness, visualising contextual information for a pathogen for surveillance purposes, and many other reasons. There are a number of tools available to both view, manipulate and enhance tree data.

In this Lesson From the Field you will use the Interactive Tree of Life (iTOL - https://itol.embl.de/) to do some very quick visulisation, inspection and manipulation of trees. It is also possible to do annotations and apply metadata but this will be outside the scope of this LFF.

You will then take a subset of the tree looked at in iTOL and import it into an Rmarkdown document to do some visualisations. This will also serve as a bit of an overview of Rmarkdown and its utility both for analysis developement as well as documenting your work and being able to produce reports from this.

1.1 Learning objectives

  1. Understand the importance of contextual epidemiological information for whole genome sequencing analysis
  2. Use Interactive Tree Of Life to quickly view phylogenetic trees
    • Apply quick transformations and pruning
  3. Execute R code to read in, manipulate and produce phylogentic trees
    • Edit R code to customise outputs
    • Apply some metadata to a phylogenetic tree
  4. Achieve a brief introduction to rmarkdown and the pacman approach to package management

1.2 Overview and Resources

The majority of this LFF will be following along this webpage. Later you will run, edit, and write some R code in the provided .rmd file. Some introductory resources to this have also been provided in the following sections.

The files can either be found within the LFF folder. Or you can download them by clicking the links below (if they open as a text web page, right click and choose “save as”). This webpage is actually an Rmarkdown document hosted online. If you want to see the actual file, you can find it in the github repo this is hosted on - this is left as an exercise to the reader.

Later in this exercise we will create a subtree from the big tree. The link below is the produced subtree if you are unable to create it and get stuck.

1.3 RStuidio Setup

Some minor set up will be required. This will also serve as a brief introduction to rmarkdown as well as the pacman package, which can serve as a relatively straightforward way to facilitate a degree code transferability.

You have “r file to be entered” but before that we do some set up in the R console.

Open RStudio

First we will install the pacman package (https://cran.r-project.org/package=pacman). pacman is basically a wrapper for a lot of package management tasks. The most useful one I have found is the p_load function, which says to load these libraries, and if they are not installed, to install them. If you work in a way where everybody uses pacman then you don’t have to worry about what packages you need to install for someones script as they will happen automatically when you run it.

The only requirement is that you have this package installed yourself first and that you need to load it first in your code as normal.

Run the following in your RStudio command line.

install.packages("pacman")

Now you can have a convention that when you write a script you start with your code like the below example:

library(pacman) # load the pacman package

# The following then loads (and installs if missing) the following packages.
# Having each package on a line like this allows for commenting to explain what/why each package is used

p_load(tidyverse, 
       rio, # package for easy import/export of data
       treeio, # package for input/output tree data
       ggtree) # visualising trees with syntax like ggplot (included in tidyverse)

While we are here, lets check we have installed / updated the packages needed for rmarkdown and knitting.

Run the following in your RStudio command line.

install.packages(c("rmarkdown", "knitr"))

We are ready to rock! We will actually not be starting in R but good on you for getting everything set up!

2 Interactive Tree of Life

2.1 Introduction

Firstly we aren’t going to start in RStudio!

The Interactive Tree of Life (iTOL) is a free web based interface to do phylogentic tree investigations and visulisations.

It is very powerful as a tool itself and can also overlay metadata but For the purposes of this lesson we will keep this simple to demonstrate the utility of iTOL for doing a quick look at a tree.

Open the iTOL homepage (https://itol.embl.de/)

You can create an account if you wish to save trees but this is not needed. Click on the Upload button in the top bar.

2.2 Uploading a Tree

Have a read of the page. There are a couple of options when uploading the tree. You can either Browse to your tree file or drag and drop it directly into the window.

You don’t need to worry about the different tree file formats. In my experience, Newick is the most predominant and what we will be using.

Note: There is some flexibility in the format of the filename extension that is uploaded. All tree files are basically a text file. iTOL (and R packages) will accept a few filetypes - it is what is inside that matters. Some tree files with be filename.txt or some will be filename.tree. This doesn’t matter to us.

Choose either method and upload Shigella_tree.txt.

The page will autoload if you drag and drop, otherwise press the upload button if you browsed and selected.

2.3 Trees in iTOL

You should now see the default view of a large tree. iTOL will default to a circle tree as this tends to be more appropriate for larger datasets. Smaller ones will default to a rectangular tree.

You can zoom in with a mouse scroll wheel and click and drag the tree around. Hovering your mouse over labels will provide some technical information (or metadata if present).

In the top right of the page you will see a control panel for the tree visualisation.

Try some of the different options in the Basic tab.

We won’t be looking at the Advanced or Datasets tab.

Most options in the Basic should be reasonably self explanatory and will control things like the size, font, position and alignment of the labels.

Convention and preferences may vary but many seem to prefer Label options -> Position -> At tips. So that it can be easier to get a sense of branches and where labels match up on trees. Often though, if you zoom in to parts of large trees with many leaves when doing this they may overlapp. You can change the Font Style to a smaller number (by defaulted measured in pixels ~ px) to see clearer.

Size of 10px (left) vs 4px (right) - note the labels will be different on your screen as they were changed to more easily generate and match epid data later.

2.4 Pruning a Tree

iTOL can also be useful to easily prune a tree to a subset that you are interested in. For example, you have a large dataset of SARS-CoV-2 sequences containing a couple of sublineages. You want to further analyse one lineage or an outbreak sitting on its own branch.

Prunning allows you to select a node / clade and all of the leaves contained within it.

Lets prune our large tree to just look at the isolates that (depending on if you changed the rotation or arc of the visualisation) is on the upper right part of the circle tree. Hovering over the long node/branch it should indicate that this contains 44 leaves.

Clicking this will show a menu for Clade functions.

Note: You can also click on labels to get Leaf functions in a similar way. This can often be useful to Copy node ID if you need to further examine a particular isolate in an associated linelisting or to copy this ID to highlight it through functions.

Press Add clade to pruned tree

This will highlight everything below in green and pop up a box with a summary of your pruning selection. You can add other branches if you need to include other clades or groups of isolates for analysis.

Press Prune tree

Now you should have a greatly reduced tree that likely looks very strange presented as a circle. Use the control panel to fix this.

Use the control panel to change the Mode -> Rectangular.

Presented now should be a style of tree you may have seen before. Similar you can add customisation here and prune further if need be but we will now use this to export this smaller subtree as a file.

2.5 Exporting a Tree

The final tab on the Control panel is the Export tab. Using this we can export the current tree. We will do this with this pruned tree to use this smaller tree in some R visualisations.

Press Export tab on the Control panel

The default format is SVG (a graphics picture format).

Change this to Newick tree under Text as below.

Leave the other options at their default. For the Filename we will choose a name to go with the R file we will work on.

Set the file name option to subtree and press Export

This should automatically prompt you to save the file somewhere or it will download automatically to your “Downloads” folder (or wherever you set this) depending on your browser settings.

!!IMPORTANT!! Make sure you save/move this file to the same folder as the LFF_response.rmd file.

This concludes the iTOL section and now we will move to RStudio.

3 Trees in R

Move to your RStudio session and open the LFF_response.rmd file to work in and write your responses into.

You may get a notice or banner at the top to install some necessary packages such as knitr or rmarkdown, please press install if asked.

3.1 A Brief Introduction to Rmarkdown

The file you are working in is an Rmarkdown document. You may notice instead of every line being code with some comments like an R script. This one allows you to write a report and the code is separated into chunks.

Firs thing you may notice is the block of code at the top. This header is basically the instructions to RStudio on how to compile the document and some basic information.

Edit your name (within the quote marks) into the header

---
title: "LFF: Visualising Phylogentic Trees"
author: "Your Name Here" # Edit your name into here
date: "2023-04-03"
output: html_document
---

The output option says what format to compile the document into with the most common being html_document, pdf_document and word_document. You don’t need to edit these in. When you choose to Knit the document, these options will be automatically added.

This document you are reading now is a rendered html document, with a few more options added (such as to get the table of contents).

Rmarkdown is an excellent way to blend your analysis with associated notes or documentation. Think of it like your data analysis lab journal. It allows you to perform chunks of analysis and rapily run and rerun chunks of code.

A chunk of R code in an .rmd document

Yes. The technical term for these blocks of code is a chunk.

As above you can see that the area of code is defined by three back ticks ```{r} followed by your code and closed with another three backticks ```.

This allows you to just run the code within that section by pressing the green play arrow in the top right corner. The code is executed and the results are displayed in line within the document itself instead of in the R console. This is also nice to be able to go between parts of your analysis and keep results visible without needing to rerun a potentially lengthy script.

The button in the middle with the downarrow and green line will run all code chunks above this chunk. This is very useful to run all earlier datacleaning or analysis that the chunk you are working in depends on. Either for when you come back to your document or to “refresh” your data to run your current chunk newly again.

The left button in this corner allows you to set chunk options, but you will be very unlikely to need this.

The top right of your overall .rmd document window has the Run option that pressing on also gives options about running chunks or individual lines of code.

You can either copy paste a chunk set up or type manually the text to create a code chunk. Alternatively, you can use the keyboard shortcut: ctrl + alt + i. You can try this in LFF_response.rmd.

There will be some questions for you to answer during this part of the exercise. This will be signified in the response document by the following:

<div class="response">
Type your response in here.
</div>

This is so your responses will stand out when looking at your final document. You can go over multiple lines, just make sure the first line has the <div class="response"> and the final line has </div>.

3.2 Importing Our Tree

From here, much of the information may be similar or duplicated between this document and LFF_response.rmd. As you will run the code in there and follow along with the following outputs to get the same results!

Follow along below and in your response document

Lets import the subtree we created in the iTOL section. At a basic import. It doesn’t look like much when using the base R plotting as well as a basic gg plot from ggtree.

tree <- read.tree("subtree.txt") # Import our tree file

plot(tree) # Default plotting using base R

ggtree(tree) # Default plotting using ggtree

ggtree(tree, layout = "circular") # By default ggtree will plot rectangular, use this to plot circular

You will also notice that when you run a chunk with multiple plots like this, they will tab themselves within the document. In this webpage they show up separately with each code line between as when documents are knitted, outputs are presented as they go.

You can also press the left most button in the top right corner to show this in a new window

Answer the following question in your response document.

What are the differences between each of the tree in R and with the visualisation we had in iTOL?

The default ggtree does not include much since it requires you to be more explicit about what you want from it. The gg syntax can be summarised as your base data for your plot and then you add on (with a +) other visual pieces (geom’s: geom ~ geomtry) to it.

p <- ggtree(tree) + geom_tiplab(size = 2) # add on tip labels to your plot with the given size

p # since the above line creates the 'p' object, we need to print it by writing it here

Note: When asigning a plot to a variable (p above), we need to then state it so it prints into the document. Because we are working in an rmarkdown document, you will notice it doesn’t show up in the usual RStudio plot window. To do this, you can just type p enter into your R console since this is stored in your environment. You can try it now and see the plot show up in your plot window.

3.3 Importing and Attaching Data

First we import our data. This is very straightforward as we are using the packed rio (for R Input/Output). It will detect the file type and import appropriate (most of the time).

epidata <- import("subtree_epidata.csv")

Simple!

Lets have a look at what is in our epidata. You can do this by just clicking on the object in the top right panel of RStudio. Click on epidata and it will view it in the main RStudio panel.

Sample_ID Residence Travel Resistance Sex
Sample_178 NSW International Drug A Female
Sample_179 NSW International Drug A Female
Sample_180 VIC International Drug A Male
Sample_181 VIC Unknown
Sample_182 VIC Unknown
Sample_183 ACT Unknown
Sample_184 ACT State Drug C Male
Sample_185 ACT Drug C Male
Sample_186 ACT Male
Sample_187 VIC Male
Sample_188 VIC Male
Sample_189 VIC Male
Sample_190 VIC Drug C Male
Sample_191 QLD Female
Sample_192 QLD Female
Sample_193 QLD Male
Sample_194 QLD State Male
Sample_195 QLD Male
Sample_196 QLD Female
Sample_197 NSW Drug B Unknown
Sample_198 NSW Drug B Female
Sample_199 NSW Drug B Female
Sample_200 NSW Drug B Female
Sample_201 NSW State Drug B Male
Sample_202 NSW State Drug B Female
Sample_203 NSW Drug B Female
Sample_204 NSW Drug B Male
Sample_205 NSW Drug B Male
Sample_206 QLD Drug A Female
Sample_207 QLD Drug A Female
Sample_208 QLD Drug A Male
Sample_209 QLD Male
Sample_210 VIC State Unknown
Sample_211 Laos Female
Sample_212 Laos Female
Sample_213 Laos Female
Sample_214 Laos Female
Sample_215 Vietnam Male
Sample_216 Vietnam Drug C Male
Sample_217 Vietnam Female
Sample_218 Vietnam Male
Sample_219 Vietnam Drug C Male
Sample_220 Vietnam Female
Sample_221 NSW State Unknown

We need to attach epi or meta data to our tree. We do this using the attacher operator: %<+%

Yes, it is a bit of a weird one. The general syntax is as follows

plotobject %<+% data

You can think of this like an assignment (<-) combined with the + because you are adding data in.

Important! The first column of your epidata needs contain the same labels as present in your tree object so that R knows which sequence is associated with which epidata entry.

This is how we do it here

p <- ggtree(tree) %<+% epidata #attach our epidata to the tree object and reassign (update it)

3.4 Visualising Our Data

Now the fun stuff!

We see that we have a few bits of data with the most complete being Residence. Our data set seems to be a bit odd in terms of location but it must make for an interesting group then.

Like we discussed before, we add the tip label to the plot object. This time, specifying that we want to fill based on Residence and apply this to the label geometry of our tree. This looks something like this:

p_res <- p + geom_tiplab(aes(fill = Residence), geom = "label", size = 2) 

p_res

Looking at this we can see some groupings and potential clades of samples based on the tree structure and the regions.

We can highlight clades of interest. This is done based on the internal node reference structure in the tree. The easiest way to identify these is to add the nodes to our plot for visual inspection. We do this by adding some text geometry and specify that we want to see the node labels and colour them so they are visible.

p_node <- p_res + geom_text(aes(label = node), size = 5, colour = "darkred")

p_node

On your response document you will definitely want to launch this in a new window or load p_node in your console and zoom in.

These node numbers are generated as a part of the tree file and can be referred to for colouring, rotating and general manipulation of a tree.

We can kind of see that there are a few distinct groupings and the parent node of these are 79 (arguably you could split this between the two states and Sample_210 is in an interesting position but we will keep this simple), 70 and 60.

Now we have identified these, we can apply some highlighting to these regions.

You will also see below that it can be helpful for managing your code, and so it is readable, to write over multiple lines. You can continue lines of code in things like plots by ending the line with a + and entering to the line below.

p_highlight <- p_res + 
  geom_highlight(node = 79, fill = "yellow") +
  geom_highlight(node = 70, fill = "lightblue")

p_highlight

Hmm, that doesn’t cover all the way across. We can manually extend the region, which does become a bit of an exercise in tuning these numbers. Which is why it is very helpful to be able to rerun code in a chunk!

p_highlight <- p_res + 
  geom_highlight(node = 79, fill = "yellow", extend = 0.01) +
  geom_highlight(node = 70, fill = "lightblue", extend = 0.015)

p_highlight

Your turn to add the final group at node 60.

In your response document, add the extra line of code, with extend if need be, for node 60

Lets go back to our clean tree and show another way to add epidata.

We can also add a variable to the end of the leaves of our tree. Lets do that for the Sex variable. This is using the geometry geom_tippoint as we are colouring in the points of each tip of the tree.

p_tip <- p + geom_tiplab(size = 2) + # To add the basic sampe_id back
  geom_tippoint(aes(color = Sex), size = 2)

p_tip

3.5 Adding Heatmaps - EXTENSION ONLY

Heatmaps can get reasonably complex when trying to add them to plots. In the R world there seem to be a few options with some newer packages being developed and actively worked on.

This is included just for reference and peoples interest. You can copy this code and run it but this is not expected.

This is challenging to automate as a part of regular reporting. These plots requires tuning of parameters to look nice and format properly when working on just one set of outbreak data (ask me how I know that~~~~~~)

p_load(ggnewscale) # We need another library for compilation

# Data preparation
epi_travel <- data.frame("Travel" = epidata$Travel)
rownames(epi_travel) <- epidata$Sample_ID
epi_drugres <- data.frame("DrugRes" = epidata$Resistance)
rownames(epi_drugres) <- epidata$Sample_ID

# Then we need to go through and add the heatmap layers iteratively

# First the travel information
p_heat <- gheatmap(p_res + new_scale_fill(), 
                   epi_travel,
                   colnames = FALSE,
                   width = 0.03,
                   offset = 0.005) +
  scale_fill_manual(name = "Travel", 
                    breaks = c("International", "State"), 
                    values = c("red", "blue"),
                    na.value = "white") 

# Now some drug resistance information
p_heat <- gheatmap(p_heat + new_scale_fill(),
                   epi_drugres,
                   colnames = FALSE,
                   width = 0.03,
                   offset = 0.01) +
  scale_fill_manual(name = "Drug Resistance",
                    breaks = c("Drug A", "Drug B", "Drug C"),
                    values = c("green", "purple", "orange"),
                    na.value = "white")

p_heat

4 Random Tree Exercise

The final section in the response document will generate a randomly shaped tree and random dataset for you to use to create a plot based on what you have learned during this session.

Below is a code chunk to generate a random tree and a random dataset.

set.seed is there to make the random chance repeatable. Make up a number yourself and replace the 1234 with it. This means that when you submit this to me, I will see the results you worked with, but everyone in our group will generate a different tree when they use a different number.

You can check this against the same seed in your response document to see. Then change the seed to a different number you make up.

set.seed(1234)
size = 10

random_tree <- rtree(size)

random_data <- data.frame("ID" = random_tree$tip.label,
                          "Location" = sample(c("QLD","NSW","VIC"), size, replace = TRUE),
                          "Sex" = sample(c("Male", "Female"), size, replace = TRUE),
                          "Travel" = sample(c("Overseas", ""), size, replace = TRUE, prob = c(0.2, 0.8)))

r <- ggtree(random_tree) + geom_tiplab()
r

I’m interested to see what you can make!

5 Submission

To submit your response to me you can either email the LFF_response.rmd document and/or a compiled html page that you can knit by pressing the Knit option at the top of your main RStudio window.

This will generate a .html page in the folder you were working in. Email that to me.

That’s it. You are done!